begin
using PlutoUI
using StatsPlots
using LinearAlgebra
using Distributions
using ForwardDiff
using MCMCChains
using DataFrames
using Random
using LaTeXStrings
using HypertextLiteral
using Logging; Logging.disable_logging(Logging.Info);
end;
TableOfContents()
At a first glance, it might be hard to see why Bayesian computation can be difficult. After all, Baye's theorem has provided us with a very straightforward mechanism to compute the posterior:
$$\text{posterior}=\frac{\text{prior} \times \text{likelihood}}{\text{evidence}}\;\;\text{or}\;\;p(\theta|\mathcal D) =\frac{p(\theta) p(\mathcal D|\theta)}{p(\mathcal D)},$$
where
$$p(\mathcal D) = \int p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta,$$
known as evidence or marginal likelihood, is a constant with respect to the model parameter $\theta$.
The recipe is only easy to write down, but the daunting bit actually lies in the computation of the normalising constant: $p(\mathcal D)$. The integration is often high-dimensional, therefore usually intractable.
As a result, posterior probability calculations, such as $\theta \in A$
$$\mathbb{P}(\theta \in A|\mathcal D) = \int_{\theta \in A} p(\theta |\mathcal D) \mathrm{d}\theta = \frac{ \int_{\theta \in A} p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)},$$
can not be evaluated exactly. For example, in the coin tossing example introduced last time, we may consider any coin with a bias $0.5 \pm 0.05$ fair, the posterior we want to know is
$$\mathbb{P}(0.45 \leq \theta \leq 0.55|\mathcal D).$$
We sidestepped the integration problem last time by discretising the parameter space $\Theta = [0,1]$ into some finite discrete choices. The method has essentially replaced a difficult integration with a brute-force enumeration summation
$$\mathbb{P}(0.45 \leq \theta \leq 0.55|\mathcal D) = \frac{\int_{.45}^.55 p(\theta)p(\theta |\mathcal D) \mathrm{d}\theta}{p(\mathcal D)}\approx \frac{\sum_{\theta_0'\in \{.5\}} p(\theta=\theta_0')p(\mathcal D|\theta=\theta_0')}{\sum_{\theta_0\in \{0, 0.1, \ldots, 1.0\}} p(\theta=\theta_0)p(\mathcal D|\theta=\theta_0)}.$$
Unfortunately, this brute force discretisation-based method is not scalable. When the parameter space's dimension gets larger, the algorithm becomes too slow to use. To see it, consider discretising a $D$ dimensional parameter space, $\Theta \in R^D$, if each parameter is discretised with 2 choices (which is a very crude discretisation), the total discretized space is of order $2^D$, which grows exponentially. Such an exponentially growing size soon becomes problematic for all modern computers to handle.
What's worse, the difficulty does not end here. Assuming we knew the normalising constant, i.e. $p(\mathcal D)$ were known and the posterior can be evaluated exactly, we still need to evaluate numerator's integration: $\int_{\theta \in A} p(\theta)p(\mathcal D|\theta) \mathrm{d}\theta$, which is again generally intractable.
To summarise, the difficulty of Bayesian computation are two-folds
the posterior distribution is only evaluable up to some unknown normalising constant;
posterior summary involves integrations, which are intractable.
Recall that, to summarise a posterior, we need to calculate intractable integrations such as
$$\mathbb{P}(\theta \in A|\mathcal D) = \int_{\theta \in A} p(\theta |\mathcal D) \mathrm{d}\theta = \frac{ \int_{\theta \in A} p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)}.$$
More generally, we want to calculate expectations of any functions of random variable $\theta$ under the posterior:
texeq("\\mathbb E[t(\\theta)|\\mathcal D] = \\int t(\\theta) p(\\theta|\\mathcal D) \\mathrm{d}\\theta = \\frac{\\int t(\\theta) p(\\theta) p(\\mathcal D|\\theta) \\mathrm{d}\\theta}{p(\\mathcal D)} \\label{exp}")
Note that when $t(\cdot)$ is a counting function, e.g.
$$t(\theta) = \mathbf 1(\theta \in A)=\begin{cases} 1 & \theta \in A \\ 0 & \theta \notin A,\end{cases}$$
we recover the first question as
$$\mathbb E[t(\theta)|\mathcal D] = \int \mathbf{1}(\theta\in A) p(\theta|\mathcal D) \mathrm{d}\theta = \int_{\theta \in A} 1\cdot p(\theta|\mathcal D) \mathrm{d}\theta = \mathbb P(\theta\in A|\mathcal D).$$
That's why the expectation problem is more "general".
Therefore, the problem we want to tackle is:
Problem 1: How to estimate expectations of functions of $\theta$, such as equation (1), under the posterior?
Monte Carlo methods are widely known for their capability to approximate intractable integrations. Suppose we can sample from the posterior distribution,
$$\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(R)} \sim p(\theta|\mathcal D),$$
if the sample size, $R$, is large enough, due to the law of large numbers, we can approximate integration by frequency counting:
$${\mathbb{P}}(\theta \in A|\mathcal D) \approx \frac{\#\{\theta^{(r)} \in A\}}{R},$$
where $\#\{\cdot\}$ counts how many samples falls in set $A$.
And general expectations of the form
$$\mathbb E[t(\theta)|\mathcal D] = \int t(\theta) p(\theta|\mathcal D) \mathrm{d}\theta = \frac{\int t(\theta) p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)}$$
can also be approximated by its Monte Carlo estimation
$${\mathbb{E}}[t(\theta)|\mathcal D] \approx \hat t =\frac{1}{R} \sum_{r} t(\theta^{(r)}),$$
which is the sample average evaluated at the drawn samples.
The following diagram illustrates the idea of Monte Carlo approximations. The true posterior distribution $p(\theta|\mathcal D)$ is plotted on the left; and the histogram of $R=2000$ random samples of the posterior is plotted on the right. Monte Carlo method essentially uses the histogram on the right to replace the true posterior. For example, to calculate the mean or expectation of $\theta$, i.e. $t(\theta) = \theta$, the Monte Carlo estimator becomes
$$\mathbb E[\theta|\mathcal D] \approx \frac{1}{R} \sum_r \theta^{(r)},$$
the sample average of the samples (which is very close to the ground truth on the left).
begin
approx_samples = 2000
dist = MixtureModel([Normal(2, sqrt(2)), Normal(9, sqrt(19))], [0.3, 0.7])
Random.seed!(100)
x = rand(dist, approx_samples)
ids = (x .< 15) .& (x .> 0)
prob_mc=sum(ids)/approx_samples
end;
let
plt = Plots.plot(
dist;
xlabel=raw"$\theta$",
ylabel=raw"$p(\theta|\mathcal{D})$",
title="true posterior",
fill=true,
alpha=0.3,
xlims=(-10, 25),
label="",
components=false,
)
Plots.vline!(plt, [mean(dist)]; label="true mean", linewidth=3)
plt2 = Plots.plot(
dist;
xlabel=raw"$\theta$",
fill=false,
alpha=1,
linewidth =3,
xlims=(-10, 25),
label="",
components=false,
)
histogram!(plt2, x, bins=110, xlims=(-10, 25), norm=true, label="", color=1, xlabel=raw"$\theta$", title="Monte Carlo Approx.")
Plots.vline!(plt2, [mean(x)]; label="MC mean", linewidth=3, color=2)
Plots.plot(plt, plt2)
end
Similarly, calculating probability, such as $\mathbb P(0\leq \theta \leq 15)$, reduces to frequency counting:
$$ \hat{\mathbb{P}}(0\leq \theta \leq \theta) = \frac{\#\{0\leq \theta^{(r)}\leq 15\}}{2000} =0.905,$$
counting the proportion of samples that falls in the area of interest. The idea is illustrated in the following diagram.
let
plt2 = Plots.plot(
dist;
xlabel=raw"$\theta$",
fill=false,
alpha=1,
linewidth =3,
xlims=(-10, 25),
label="",
components=false,
size=(300,450)
)
histogram!(plt2, x, bins = -10:0.3:25 , xlims=(-10, 25), norm=false, label="", color=1, xlabel=raw"$\theta$", title="Monte Carlo est.")
Plots.plot!(plt2, 0:0.5:15,
dist;
xlabel=raw"$\theta$",
fill=true,
color = :orange,
alpha=0.5,
linewidth =3,
xlims=(-10, 25),
label=L"0 \leq θ \leq 15",
components=false,
)
histogram!(plt2, x[ids], bins = -10:0.3:25, xlims=(-10, 25), norm=false, label="", color=:orange, xlabel=raw"$\theta$")
end
The most important property of Monte Carlo method is its scalability, which makes it a practical solution to Problem 1 even when $\theta \in R^D$ is of high dimension.
The accuracy of the Monte Carlo estimate does not depend on the dimensionality of the space sampled, $D$. Roughly speaking, regardless of the dimensionality of $\theta$, the accuracy (squared error from the mean) remains the same.
Foldable("Further details about the scalability*", md"For simplicity, we assume ``t`` is a scalar-valued function. Note all expectations here are w.r.t the posterior distribution from which the samples are drawn. Firstly, it can be shown that the Monte Carlo estimator is unbiased:
$$\mathbb E[\hat t] = \mathbb E\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ] =\frac{1}{R}\sum_r \mathbb E[t(\theta^{(r)})] = \mathbb E[t(\theta)].$$ It means, on average, the estimator converges to the true integration value.
To measure the estimator's accuracy, we only need to find the estimator's variance as it measures the average squared error between the estimator and true value:
$$\mathbb V[\hat t] = \mathbb E[(\hat t -t)^2].$$
If we assume samples are independent draws from the distribution, the variance is then
$$\mathbb V[\hat t] =\mathbb V\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ]= \frac{1}{R^2} \sum_r \mathbb{V}[t(\theta^{(r)})]=\frac{R\cdot \mathbb{V}[t(\theta^{(r)})]}{R^2}=\frac{\sigma^2}{R},$$ where
$$\sigma^2 = \mathbb V[t] = \int p(\theta|\mathcal D) (t(\theta)-\mathbb E[t])^2\mathrm{d}\theta$$ is some positive constant that only depends on the function ``t``. Therefore, as the number of samples ``R`` increases, the variance of ``\hat t`` will decrease linearly (the standard deviation, ``\sqrt{\mathbb V[\hat t]}``, unfortunately, shrinks at a rate of ``\sqrt{R}``). Note the accuracy (variance) only depends on ``\sigma^2``, the variance of the particular statistic ``t`` rather than ``D``, the dimensionality.")
For simplicity, we assume $t$ is a scalar-valued function. Note all expectations here are w.r.t the posterior distribution from which the samples are drawn. Firstly, it can be shown that the Monte Carlo estimator is unbiased:
$$\mathbb E[\hat t] = \mathbb E\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ] =\frac{1}{R}\sum_r \mathbb E[t(\theta^{(r)})] = \mathbb E[t(\theta)].$$
It means, on average, the estimator converges to the true integration value.
To measure the estimator's accuracy, we only need to find the estimator's variance as it measures the average squared error between the estimator and true value:
$$\mathbb V[\hat t] = \mathbb E[(\hat t -t)^2].$$
If we assume samples are independent draws from the distribution, the variance is then
$$\mathbb V[\hat t] =\mathbb V\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ]= \frac{1}{R^2} \sum_r \mathbb{V}[t(\theta^{(r)})]=\frac{R\cdot \mathbb{V}[t(\theta^{(r)})]}{R^2}=\frac{\sigma^2}{R},$$
where
$$\sigma^2 = \mathbb V[t] = \int p(\theta|\mathcal D) (t(\theta)-\mathbb E[t])^2\mathrm{d}\theta$$
is some positive constant that only depends on the function $t$. Therefore, as the number of samples $R$ increases, the variance of $\hat t$ will decrease linearly (the standard deviation, $\sqrt{\mathbb V[\hat t]}$, unfortunately, shrinks at a rate of $\sqrt{R}$). Note the accuracy (variance) only depends on $\sigma^2$, the variance of the particular statistic $t$ rather than $D$, the dimensionality.
In the previous section, we have established that Monte Carlo estimation is a scalable method if we can obtain samples from a posterior distribution. That is a big "if" to assume. Without a practical sampling method, no Monte Carlo estimators can be calculated.
We should also note that for a general Bayesian inference problem the posterior can only be evaluated up to some unknown constant:
$$p(\theta|\mathcal D) \propto p(\theta) p(\mathcal D|\theta),$$
where the scalar $1/p(\mathcal D)$ involves a nasty integration which we do not know how to compute.
The question we face now is a sample generation problem:
Problem 2: how to generate samples $\{\theta^{(r)}\}_{r=1}^R$ from a un-normalised distribution: $p(\theta|\mathcal D) \propto p(\theta)p(\mathcal D|\theta)?$
Note. If we apply log on the posterior distribution, the log density becomes a sum of log prior and log-likelihood: $\ln p(\theta|\mathcal D) = \ln p(\theta) + \ln p(\mathcal D|\theta),$ which is faster and numerically stable for computers to compute. Additions are in general faster than multiplications/divisions for floating number operations.
A class of methods called Markov Chain Monte Carlo (MCMC) is a popular and successful candidate to generate samples from a non-standard distribution. Markov Chain Monte Carlo (MCMC) algorithm is formed by two concepts:
Markov Chain: $p(\theta^{(r)}|\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(r-1)}) = p(\theta^{(r)}|\theta^{(r-1)})$
the current state only depends on the previous state given the whole history
samples are generated by some carefully crafted Markov Chains
current sample depends on the previous sample
Monte Carlo: estimation by the Markov chain samples as illustrated in the previous section
The idea is to produce posterior samples $\{\theta^{(r)}\}_{r=1}^R$ in sequence, each one depending only on $\theta^{(r-1)}$ and not on its more distant history of predecessors, i.e. a Markov Chain (which accounts for the first MC of the acronym MCMC). When the transition probability of the chain satisfies certain conditions, Markov Chain theory then says that, under quite general conditions, the empirical distribution of the simulated samples will approach the desired target distribution as we simulate the chain long enough, i.e. when $R$ is large.
Since the sequence converges to the target distribution when $R$ is large, we usually only retain the last chunk of the sequence as "good samples" or equivalently discard the initial samples as burn-in since they are usually not representative of the distribution to be approximated. For example, if the Markov Chain has been simulated by 4,000 steps, we only keep the last 2000 to form an empirical distribution of the target.
Metropolis-Hastings (MH) algorithm is one of the MCMC methods (and arguably the most important one). MH constructs a Markov Chain in two simple steps:
it proposes a candidate $\theta_{\text{proposed}}$ based on the current state $\theta_{\text{current}}$ according to a proposal distribution: $q(\theta_{\text{proposed}}|\theta_{\text{current}})$
it then either moves to $\theta'$ (or accept the proposal) or stays put (or reject) based on the proposal's "quality" which involves a ratio:
$$\frac{p(\theta_{\text{proposed}}|\mathcal D)}{p(\theta_{\text{current}}|\mathcal D)}^\ast$$
intuitively, if the proposal state is good, or the ratio is well above 1, we move to $\theta_{\text{proposed}}$, otherwise stay put
$\ast$ note the acceptance probability ratio also involves another ratio of the proposal distributions (check the algorithm below for details)
A key observation to note here is MH's operations do no involve the normalising constant:
$$\frac{p(\theta_{\text{proposed}}|\mathcal D)}{p(\theta_{\text{current}}|\mathcal D)} = \frac{\frac{p(\theta_{\text{proposed}})p(\mathcal D|\theta_{\text{proposed}})}{\cancel{p(\mathcal D)}}}{\frac{p(\theta_{\text{current}})p(\mathcal D|\theta_{\text{current}})}{\cancel{p(\mathcal D)}}} = \frac{p(\theta_{\text{proposed}})p(\mathcal D|\theta_{\text{proposed}})}{p(\theta_{\text{current}})p(\mathcal D|\theta_{\text{current}})} \triangleq \frac{p^\ast(\theta_{\text{proposed}})}{p^\ast(\theta_{\text{current}})}.$$
Therefore, we only need to evaluate the un-normalised posterior distribution $p^\ast$.
The algorithm details are summarised below.
Initialise $\theta^{(0)}$ arbitrary
For $r = 1,2,\ldots$:
sample a candidate value from $q$:
$$\theta' \sim q(\theta|\theta^{(r)})$$
evaluate $a$, where
$$a = \text{min}\left \{\frac{p^\ast(\theta')q(\theta^{(t)}|\theta')}{p^\ast(\theta^{(t)}) q(\theta'|\theta^{(t)})}, 1\right \}$$
set
$$\theta^{(t+1)} = \begin{cases} \theta' & \text{with probability } a\\ \theta^{(t)} & \text{with probability } 1-a\end{cases}$$
Remark: when the proposal distribution is symmetric, i.e.
$$q(\theta^{(t)}|\theta')= q(\theta'|\theta^{(t)}),$$
the acceptance probablity reduces to
$$a = \text{min}\left \{\frac{p^\ast(\theta')}{p^\ast(\theta^{(t)}) }, 1\right \}$$
and the algorithm is called Metropolis algorthm. A common symmetric proposal distribution is random-walk Gaussian, i.e. $q(\theta^{(t)}|\theta')= \mathcal N(\theta', \Sigma),$ where $\Sigma$ is some fixed variance matrix.
We want to obtain samples from a completely biased die that lands with threes (⚂) all the time. If one uses the Metropolis-Hastings algorithm [1] to generate samples from the biased die and a fair die is used as the proposal of the MH algorithm.
Is the proposal symmetric?
What would the MCMC chains look like?
Will the chain ever converge to the target distribution? And do we need to discard any as burn-in?
Depending on the initial state, the chain can either be $3,3,3,\ldots$ or $x,x,\ldots,x,3,3,3,\ldots$ where $x \in \{1,2,4,5,6\}$.
The proposal is symmetric. Since a fair die is used as the proposal, the proposal probability distribution is 1/6 for all 6 facets.
As a result, the acceptance probability $a$ only depends on the ratio of the target distribution. There are two possible scenarios:
the chain starts at 3, then all following proposals other than 3 will be rejected (as $a=\tfrac{0}{1}=0\%$). Therefore, only samples of 3 will be produced.
the chain is initialised with a state other than 3, e.g. $x=1$, then all proposals other than 3 will be rejected [2], so the initial samples will be all ones ($1,1,1,\ldots$); until a three is proposed, then it will be accepted with $a=\frac{1}{0}= 100\%$ chance; and only three will be produced onwards (the same argument of case 1).
Regardless of the starting state, the chain will eventually converge and produce the correct samples, i.e. $3,3,3,3,\ldots$. For chains starting with a state other than 3, the initial chunk (all ones in the previous case) should be discarded as burn-in: the chain has not yet converged.
Luckily, the discard initial chunk will be short. On average, we should expect the burn-in lasts for 6 iterations only, as the length is a geometric random variable with $p=1/6$.
To demonstrate the idea, consider sampling from a bivariate Gaussian distribution:
$$\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} \sim \mathcal N\left (\begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\right ).$$
The Gaussian distribution has a mean of zero and variances of 1; $\rho$ measures the correlation between the two dimensions. Here we use $\rho = 0.9$. The probability density can be written as
$$p^\ast(\boldsymbol \theta) \propto \text{exp}\left (-\frac{1}{2} \boldsymbol \theta^\top \mathbf \Sigma^{-1} \boldsymbol \theta\right ),$$
where $\mathbf \Sigma = \begin{bmatrix} 1 & 0.9 \\ 0.9 & 1 \end{bmatrix}.$ The target distribution's contour plot is shown below.
begin
ρ = 0.9
μ = [0, 0]
Σ = [1.0 ρ; ρ 1]
target_p = (x) -> logpdf(MvNormal(μ, Σ), x)
plt_contour = plot(-5:0.1:5, -5:0.1:5, (x, y)-> exp(target_p([x,y])),legend = :none, st=:contour, linewidth=0.5, la=0.5, levels=5, xlabel=L"\theta_1", ylabel=L"\theta_2", title="Contour plot a highly correlated bivariate Gaussian")
end
To apply an MH algorithm, we adopt a simple uncorrelated bivariate Gaussian as the proposal distribution:
$$q(\theta'|\theta^{(t)}) = \mathcal N\left (\theta^{(t)}, \sigma^2_q \times \begin{bmatrix} 1 & 0 \\ 0 & 1\end{bmatrix}\right )$$
the proposal distribution proposes a Gaussian random walk conditional on the current state $\theta^{(t)}$.
$\sigma^2_q > 0$ is a tuning parameter to set.
note that since the proposal distribution is symmetric, i.e.
$$q(\theta'|\theta^{(t)}) = q(\theta^{(t)}|\theta')$$
the acceptance probability simplifies to
$$a = \text{min}\left \{\frac{p^\ast(\theta')}{p^\ast(\theta^{(t)})}, 1\right \}$$
The Metropolis-Hastings algorithm with the mentioned proposal distribution can be implemented in Julia as follows:
# Metropolis Hastings with isotropic Gaussian proposal
- `ℓπ`: the target log probability density
- `mc`: number of Monte Carlo samples to draw
- `dim`: the dimension of the target space
- `Σ`: proposal's covariance matrix
- `x0`: initial starting state
function metropolis_hastings(ℓπ, mc=500; dim=2, Σ = 10. * Matrix(I, dim, dim), x0=zeros(dim))
samples = zeros(dim, mc)
samples[:, 1] = x0
L = cholesky(Σ).L
ℓx0 = ℓπ(x0)
accepted = 0
for i in 2:mc
# radomly draw from the proposal distribution
# faster than xstar = rand(MvNormal(x0, Σ))
xstar = x0 + L * randn(dim)
ℓxstar = ℓπ(xstar)
# calculate the acceptance ratio `a`
# note ℓπ is in log scale, need to apply exp
a = exp(ℓxstar - ℓx0)
# if accepted
if rand() < a
x0 = xstar
ℓx0 = ℓxstar
accepted += 1
end
# store the sample
@inbounds samples[:, i] = x0
end
accpt_rate = accepted / (mc-1)
return samples, accpt_rate
end
To use the algorithm draw samples from the target Gaussian distribution, we simply feed in the required inputs, e.g.
# set the target posterior distribution to sample
target_ℓπ = (x) -> logpdf(MvNormal(μ, Σ), x)
# set proposal distribution's variance
σ²q = 1.0
mh_samples, rate=metropolis_hastings(target_p, 2000; Σ = σ²q * Matrix(I, 2, 2), x0=[-2.5, 2.5])
An animation is listed below to demonstrate the MH algorithm with a proposal distribution $\sigma^2_q = 1.0$, where
the ${\color{red}\text{red dots}}$ dots are the rejected proposals
the ${\color{green}\text{red dots}}$ dots are the MCMC samples (either accepted new proposals or stay-put old samples)
begin
anim_σ_1 = demo_mh_gif(target_p; qΣ = 1.0 * Matrix(I,2,2), x₀ = [-2, 2])
gif(anim_σ_1, fps=5)
end
It can be observed that
the acceptance rate is about 0.32
and the chain is mixing well, which means the high-density area of the target distribution has been explored well.
2000 samples drawn from the MH algorithm after 2000 burn-in are shown below.
begin
Random.seed!(100)
σ²q = 1.0
mh_samples = metropolis_hastings(target_p, 4000; Σ = σ²q * Matrix(I,2,2), x0=[-2, 2])[1]
end;
begin
plt = covellipse(μ, Σ,
n_std=1.64, # 5% - 95% quantiles
xlims=(-4, 4), ylims=(-4, 4),
alpha=0.3,
c=:steelblue,
label="90% HPD",
xlabel=L"\theta_1", ylabel=L"\theta_2")
scatter!(plt, mh_samples[1, 2001:end], mh_samples[2, 2001:end], label="MH samples", mc=:red, ma=0.3)
end
Ideally, the traditional Monte Carlo method requires samples to be independent. MCMC samples, however, are simulated as Markov chains: i.e. the next sample depends on the current state, therefore MCMC samples are all dependent.
It is worth noting that Monte Carlo estimation is still valid even when the samples are dependent as long as the chain reaches equilibrium. However, an estimation's standard error, in general, deteriorates when the samples in use are more dependent.
There are two commonly used practices to reduce the temporal correlations of an MCMC chain: thinning and parallelism. And they can be used together to further improve the sample quality.
As the dependence decreases as the lag increases, one therefore can retain MCMC samples at some fixed lag. For example, after convergence, save every 10th sample.
Each MCMC chain simulates independently, to fully make use of modern computers' concurrent processing capabilities, we can run several chains in parallel.
Why should we run parallel chains rather than a single long chain? Note that ideally, we want the samples to be independent. Samples from one single chain is temporally dependent. Running multiple chains exploring the posterior landscape independently, when mixing together, produces more independent samples.
The following animation shows the evolution of four parallel Metropolis-Hastings chains. To make the visualisation cleaner, rejected proposals are not shown in the animation. The four chains start at four initial locations (i.e. the four corners). Note that all four chains quickly move to the high-density regions regardless of their starting locations, which is a key property of MCMC.
begin
Random.seed!(100)
# simulate 4000 MCMC steps for each chain
mc_iters = 4000
# pre-allocate storage for the chains
mh_samples_parallel = zeros(mc_iters, 2, 4)
# four starting locations
x0s = [[-1, -1], [-1, 1], [1,1], [1, -1]] .* 3.5
# simulate 4 chains in parallel
Threads.@threads for t in 1:4
mh_samples_parallel[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = σ²q * Matrix(I, 2, 2), x0=x0s[t])[1]'
end
end
let
# create a plot of the high probability density for the target distribution
plt_anim = covellipse(μ, Σ,
n_std=1.64, # 5% - 95% quantiles
xlims=(-4, 4), ylims=(-4, 4),
alpha=0.3,
c=:steelblue,
legend = :outerright,
label="90% HPD",
xlabel=L"\theta_1", ylabel=L"\theta_2", title="Animation of 4 Parallel MH Chains")
labels = "chain " .* string.(1:4)
# create the animation for the first 100 steps
mh_anim_parallel = @animate for i in 1:99
for ch in 1:4
scatter!(plt_anim, (mh_samples_parallel[i, 1, ch], mh_samples_parallel[i, 2, ch]),
label= i == 1 ? labels[ch] : false, mc=ch, ma=0.5)
plot!(plt_anim, mh_samples_parallel[i:i+1, 1, ch], mh_samples_parallel[i:i+1, 2, ch], st=:path, lc=ch, la=0.5, label=false)
end
end
# show the animation at 10 frames per second
gif(mh_anim_parallel, fps= 10)
end
MH algorithm's performance depends on the proposal's quality. And setting a suitable proposal distribution for an MH algorithm is not easy, especially when the target distribution is high-dimensional. As a rule of thumb, we should aim at an acceptance rate between 0.2 to 0.6. Too high or too low signals the chain is struggling.
Acceptance rate too high
Usually happens when $\sigma^2_q$ is too small, proposals are all close to each other in one high-density area $\Rightarrow$ most of the proposals are accepted, but, as a result $\Rightarrow$ other high-density areas are not explored sufficiently enough.
The following animation illustrates the idea: the proposal's variance is set as $\sigma_q^2 = 0.005$; despite a high acceptance rate of 0.924, the chain only makes tiny moves and therefore does not explore the target distribution well enough.
begin
anim_σ_001 = demo_mh_gif(target_p; qΣ = 0.005 * Matrix(I,2,2), x₀ = [-2, -2], mc = 500)
gif(anim_σ_001, fps=5)
end
Acceptance rate too low
Usually happens when $\sigma^2_q$ is too large, the proposals jump further but likely to propose less desired locations $\Rightarrow$ a lot of rejections $\Rightarrow$ very temporal correlated samples.
The chain shown below employs a more audacious proposal distribution with $\sigma_q^2 = 10.0$. As a result, most of the proposals are rejected which leads to an inefficient sampler (most of the samples are the same).
begin
anim_σ_15 = demo_mh_gif(target_p; qΣ = 10 * Matrix(I,2,2), x₀ = [-2, -2], mc = 500)
gif(anim_σ_15, fps=5)
end
To avoid the difficulty of setting good proposal distributions, more advanced variants of MH algorithms, such as the Hamiltonian sampler (HMC), No-U-Turn sampler (NUTS), have been proposed. Instead of randomly exploring the target distribution, HMC and its variants employ the gradient information of the target distribution to help explore the landscape.
MCMCChains.jlJulia has a wide range of packages to analyse MCMC chains. In particular, MCMCChains.jl package provides us tools to visualise and summarise MCMC chains.
We can construct Chains object using MCMCChains.jl by passing a matrix of raw chain values along with optional parameter names:
The initial 3 samples of the raw samples:
mh_samples_parallel[1:3, :, :]
mh_samples_parallel[1:3, :,:]
3×2×4 Array{Float64, 3}:
[:, :, 1] =
-3.5 -3.5
-3.5 -3.5
-3.5 -3.5
[:, :, 2] =
-3.5 3.5
-3.5 3.5
-3.61485 2.23203
[:, :, 3] =
3.5 3.5
3.5 3.5
3.47994 2.45702
[:, :, 4] =
3.5 -3.5
2.83202 -2.32634
2.83202 -2.32634
Create a Chains object by passing the raw sample matrix and names of the parameters (optional)
# use MCMCChains.jl to create a Chains object
# the first 2000 samples are discarded as burn-in
# Note that to apply thinning=2, one can use ``mh_samples_parallel[2001:2:end, :, :]``
chain_mh = Chains(mh_samples_parallel[2001:end, :, :], [:θ₁, :θ₂])
One can visualise Chains objects by
traceplot(chain_mh): plots the chain samples at each iteration of the Markov Chain, i.e. time series plot
density(chain_mh): plots kernel density estimations fit on the samples
or plot(chain_mh): plots both trace and density plots
For example, the following plots show the four parallel MH chains for the bi-variate Gaussian example
the left two plots show the trace plots of four chains
the density of the right show fit to the four chains
It can be seen visually that all four chains converge to roughly the same equilibrium distributions.
begin
chain_mh = Chains(mh_samples_parallel[2001:end, :, :], [:θ₁, :θ₂])
plot(chain_mh)
end
Other visualisation methods include
meanplot(c::Chains): running average of the chain
histogram(c::Chains): construct histogram of the chain
autocorplot(c::Chains): auto correlations of the chain
corner(c::Chains): make a scatter plot of the chains; a tool to view correlations between dimensions of the samples
For example, the following plot uses corner() to show a scatter plot of the chain.
begin
corner(chain_mh)
end
MCMCChains.jl also provides easy-to-use interfaces to tabulate important statistics of MCMC chains instances
summarystats: aggregates statistics such as mean, standard deviations, and essential sample size etc
describe: apart from above, it shows 2.5% to 97.5% percentile of the samples
# summary of the chain
summarystats(chain_mh)
describe(chain_mh)
summarystats(chain_mh)
| parameters | mean | std | naive_se | mcse | ess | rhat | |
|---|---|---|---|---|---|---|---|
| 1 | :θ₁ | 0.109403 | 0.991057 | 0.0110803 | 0.0477931 | 363.28 | 1.00179 |
| 2 | :θ₂ | 0.119246 | 0.995301 | 0.0111278 | 0.0482049 | 368.43 | 1.00163 |
describe(chain_mh)
2-element Vector{ChainDataFrame}:
Summary Statistics (2 x 7)
Quantiles (2 x 6)
Markov chain theory only provides us with a theoretical guarantee:
if one simulates an MCMC chain long enough, eventually the sample empirical distribution will converge to the target posterior distribution.
Unfortunately, this is only a theoretical guarantee. We do not have the time and resources to run a chain indefinitely long. In practice, we simulate chains at fixed lengths and inpect the chains to check convergence.
Luckily, there are multiple easy-to-compute metrics to diagnose a chain. Most of the techniques apply stationary time series models to access the performance of a chain. The most commonly used metrics are $\hat R$ statistic, and effective sample size (ess).
R̂ statistic (rhat)
R̂ is a metric that measures the stability of a chain. Upon convergence, any chunk of a chain, e.g. the first and last halves of the chain, or parallel chains, should be similar to each other. $\hat R$ statistic, which is defined as the ratio of within-chain to between-chain variability, should be close to one if all chains have converged.
As a rule of thumb, a valid converging chain's $\hat R$ statistic should be less than 1.01.
Effective sample size (ess)
The Traditional Monte Carlo method requires samples to be independent. However, MCMC samples are all dependent, as the chain is simulated in a Markovian fashion: the next sample depends on the previous state. It is worth noting that Monte Carlo estimation is still valid even when the samples are dependent as long as the chain has mixed well. However, the standard error of the estimation, in general, deteriorates when the samples are more dependent.
ess roughly estimates how many equivalent independent samples are in a dependent chain. Since samples in an MCMC chain are correlated, they contain less information than truly independent samples. ess therefore discounts the sample size by some factor that depends on the temporal correlation between the iterations.
We can also measure the efficiency of an MCMC algorithm by calculating
$$\text{efficiency} = \frac{\text{ess}}{\text{iterations}},$$
which measures the information content in a chain. One needs to run a highly efficient algorithm with fewer iterations to achieve the same result.
Larger effective sample size (ess) implies the MCMC algorithm is more efficient
Examples To demonstrate the ideas, we compare ess and R̂ of the three MH chains with different proposal variances: $\sigma^2_q=$ 1.0, .005 and 20. Remember, $\sigma^2_q=1.0$ performs the best; .005 and 20 are less optimal.
Summary statistics with MH chain of $\sigma^2_q=1.0$
summarystats(chain_mh)
| parameters | mean | std | naive_se | mcse | ess | rhat | |
|---|---|---|---|---|---|---|---|
| 1 | :θ₁ | 0.109403 | 0.991057 | 0.0110803 | 0.0477931 | 363.28 | 1.00179 |
| 2 | :θ₂ | 0.119246 | 0.995301 | 0.0111278 | 0.0482049 | 368.43 | 1.00163 |
begin
# simulate two chains with two different proposal variances
Random.seed!(100)
σ²q_too_small = 0.005
σ²q_too_big = 20.0
mh_samples_too_small = zeros(mc_iters, 2, 4)
mh_samples_too_big = zeros(mc_iters, 2, 4)
mh_samples_ideal = zeros(mc_iters, 2, 4)
Threads.@threads for t in 1:4
mh_samples_too_small[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = σ²q_too_small * Matrix(I, 2, 2), x0=x0s[t])[1]'
mh_samples_too_big[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = σ²q_too_big * Matrix(I, 2, 2), x0=x0s[t])[1]'
mh_samples_ideal[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = Σ, x0=x0s[t])[1]'
end
chain_too_small = Chains(mh_samples_too_small[2001:end, :, :], [:θ₁, :θ₂])
chain_too_big = Chains(mh_samples_too_big[2001:end, :, :], [:θ₁, :θ₂])
chain_ideal = Chains(mh_samples_ideal[2001:end, :, :], [:θ₁, :θ₂])
end;
Summary statistics with MH chain of $\sigma^2_q=.005$
summarystats(chain_too_small)
| parameters | mean | std | naive_se | mcse | ess | rhat | |
|---|---|---|---|---|---|---|---|
| 1 | :θ₁ | 0.351195 | 0.941285 | 0.0105239 | 0.097075 | 20.0356 | 1.88365 |
| 2 | :θ₂ | 0.398657 | 0.940385 | 0.0105138 | 0.0970787 | 19.4734 | 1.95291 |
Summary statistics with MH chain of $\sigma^2_q=20.0$
summarystats(chain_too_big)
| parameters | mean | std | naive_se | mcse | ess | rhat | |
|---|---|---|---|---|---|---|---|
| 1 | :θ₁ | 0.156851 | 0.971666 | 0.0108636 | 0.0669214 | 134.299 | 1.02462 |
| 2 | :θ₂ | 0.145832 | 0.983643 | 0.0109975 | 0.0687366 | 134.955 | 1.0273 |
It can be observed that
ess: ess are around 360, 20, and 134 respectively with the three algorithms, which implies efficiencies of 0.18, 0.01, and 0.067. The second algorithm with small proposal variance, although achieving a high acceptance rate, is the least efficient.
rhat: both the second and third algorithm's rhat > 1.01, which indicates they do not mix well.
Simple trace plots reveal a lot of information about a chain. One can diagnose a chain by visual inspection.
What do bad chains look like? The following two trace-plots show the chain traces of the two less desired MH algorithms for the bivariate Gaussian example: one with $\sigma^2_q=0.005$ and $\sigma^2_q=20$.
Recall that when $\sigma^2_q$ is too small, a chain proposes small changes at each iteration. As a result, the chain does not explore HPD region sufficiently well. If one splits the chain into two halves, the two chunks (with green and red backgrounds) exhibit drastic different values.
begin
plot(1:1000, Array(chain_too_small[1:1000,1:1, 1:1]), title="A bad chain; proposal variance " * L"\sigma^2_q = 0.005", label="", color=1, xlabel="Iteration", ylabel="Sample value", size=(600,300))
vspan!([0,1000], color = :green, alpha = 0.2, labels = "");
plot!(1001:2000, Array(chain_too_small[1001:end, 1:1, 1:1]), label="", color=1)
vspan!([1001,2000], color = :red, alpha = 0.2, labels = "");
end
On the other hand, when $\sigma^2_q$ is too large, the chain jumps back and force with large steps, resulting most proposals being rejected and old samples being retained. The chain becomes very sticky. The chain does not contain a lot of information comparing with independent samples.
plot(Array(chain_too_big[:,1:1, 1:1]), title="A bad chain; proposal variance "*L"\sigma^2_q = 20", xlabel="Iteration", ylabel="Sample value", label="", size=(600,300))
What good chains should look like ? The figure below shows trace plots of four different chains with ess= 4.6, 49.7, 90.9, and 155. The top two are bad chains. And the bottom two chains are of relatively better quality. A good chain should show stable statistical properties across the iterations with a reasonable level of variance.
let
ess = zeros(4)
ess[1] = MCMCChains.ess_rhat(chain_too_small[:,1:1,1:1])[:,:ess]
ess[2] = MCMCChains.ess_rhat(chain_too_big[:,1:1,1:1])[:,:ess]
ess[3] = MCMCChains.ess_rhat(chain_mh[:,1:1,1:1])[:,:ess]
ess[4] = MCMCChains.ess_rhat(chain_ideal[:,1:1,1:1])[:,:ess]
ess= round.(ess; digits=1)
p1 = plot(chain_too_small[:, 1, 1], title="ess=$(ess[1])", label="")
p2 = plot(chain_too_big[:,1,1], title="ess=$(ess[2])", label="")
p3 = plot(chain_mh[:,1,1], title="ess=$(ess[3])", label="")
p4 = plot(chain_ideal[1:end, 1, 1], title="ess=$(ess[4])", label="")
Plots.plot(p1, p2, p3, p4)
end
Metropolis-Hastings is considered the mother of most modern MCMC algorithms. There are a lot of other variants of MCMC algorithms that can be considered as specific cases of MH algorithm. We will quickly introduce two other popular variants: Gibbs sampling and Hamiltonian sampler. We will focus on the intuition behind the samplers.
Gibbs sampling reduces a multivariate sampling problem to a series of uni-variate samplings. Assume the target distribution is bivariate with density $p(\theta_1, \theta_2)$, and the chain is currently at state $(\theta_1^{(r)}, \theta_2^{(r)})$, Gibbs sampling alternatively samples from their full conditionals in two steps:
sample $\theta_1^{(r+1)} \sim p(\theta_1|\theta_2^{(r)})$
sample $\theta_2^{(r+1)} \sim p(\theta_2|\theta_1^{(r+1)})$
The new sample $(\theta_1^{(r+1)}, \theta_2^{(r+1)})$ is retained as a new sample.
The following diagram demonstrates how Gibbs sampling explores the bivariate Gaussian distribution. Note the zig-zag behaviour: at each step, Gibbs sampling only changes one dimension of the sample.
begin
Random.seed!(100)
# run Gibbs sampling for the bivariate Gaussian example
# check the appendix for the details of the implementation
# 4000 iterations in total and starting location at x0
samples_gibbs = gibbs_sampling(4000; x0=[-2.5, 2.5])
end;
let
plt_gibbs = covellipse(μ, Σ,
n_std=1.64, # 5% - 95% quantiles
xlims=(-4, 4), ylims=(-4, 4),
alpha=0.3,
c=:steelblue,
label="90% HPD",
xlabel=L"\theta_1", ylabel=L"\theta_2")
# create the animation
gibbs_anim = @animate for i in 1:200
scatter!(plt_gibbs, (samples_gibbs[1, i], samples_gibbs[2, i]),
label=false, mc=:red, ma=0.5)
plot!(samples_gibbs[1, i:i + 1], samples_gibbs[2, i:i + 1], seriestype=:path,
lc=:green, la=0.5, label=false)
end
gif(gibbs_anim, fps=10)
end
Multivariate Gibbs sampling The general Gibbs sampling algorithm for a multi-variate problem is listed below. The idea is the same, at each step, a series of small steps based on the full conditionals are made and chained together.
Initialise $\boldsymbol \theta^{(0)}=[\theta_1^{(0)},\theta_2^{(0)}, \ldots, \theta_D^{(0)} ]$ arbitrary
For $r = 1,2,\ldots$:
sample dimension $1$:
$$\theta_1^{(r)} \sim p(\theta_1|\theta_2^{(0)}, \ldots, \theta_D^{(0)})$$
sample dimension $2$:
$$\theta_2^{(r)} \sim p(\theta_2|\theta_1^{(1)}, \ldots, \theta_D^{(0)})$$
$$\vdots$$
sample dimension $D$:
$$\theta_D^{(r)} \sim p(\theta_D|\theta_1^{(1)}, \ldots, \theta_{D-1}^{(1)})$$
Gibbs sampling is a Metropolis-Hastings
One drawback of MH sampler is one needs to specify a proposal distribution. Gibbs sampling alleviates this burden from the user by using the full conditionals of the target distribution as proposal distribution. Gibbs sampling, therefore, is a specific case of MH algorithm. One can also show that when full conditionals are used, the acceptance probability is always 100%. Therefore, there is no rejection step in a Gibbs sampling.
Foldable("Further details on Gibbs sampling is a MH algorithm.", md"For simplicity, we only consider bivariate case. Assume the chain is currently at state ``\boldsymbol \theta^{(r)}=(\theta_1^{(r)}, \theta_2^{(r)})``, the proposal distribution proposes to move to a new state ``\boldsymbol \theta' = (\theta_1', \theta_2')`` with a proposal density
$$q(\boldsymbol \theta'|\boldsymbol \theta^{(r)}) = p(\theta_1'|\theta_2') \cdot \mathbf 1({\theta_2'= \theta_2^{(r)})},$$
where ``\mathbf 1(\cdot)`` returns 1 when the testing condition is true and 0 otherwise. The transition kernel basically states ``\theta_2^{(r)}`` is not changed.
The acceptance probability then is
$$a \triangleq \frac{p(\boldsymbol \theta') q(\boldsymbol \theta^{(r)}|\boldsymbol \theta')}{p(\boldsymbol \theta^{(r)})q(\boldsymbol \theta'|\boldsymbol \theta^{(r)})} = \frac{p(\theta_1', \theta_2') \times p(\theta_1^{(r)}|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}.$$
When ``\theta_2' = \theta_2^{(r)}``,
$$a = \frac{p(\theta_1', \theta_2^{(r)}) \times p(\theta_1^{(r)}|\theta_2^{(r)})}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2^{(r)})}= \frac{p(\theta_1', \theta_2^{(r)}) \times \frac{p(\theta_1^{(r)}, \theta_2^{(r)})}{p(\theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times \frac{p(\theta_1',\theta_2^{(r)})}{p(\theta_2^{(r)})}} = \frac{p(\theta_2^{(r)})}{p(\theta_2^{(r)})} =1.0,$$
when ``\theta_2'\neq \theta_2^{(r)}``, we have ``a=\tfrac{0}{0}=1``, a trivial case. Therefore, there is no need to test the proposal. The proposed state should also be accepted.
")
For simplicity, we only consider bivariate case. Assume the chain is currently at state $\boldsymbol \theta^{(r)}=(\theta_1^{(r)}, \theta_2^{(r)})$, the proposal distribution proposes to move to a new state $\boldsymbol \theta' = (\theta_1', \theta_2')$ with a proposal density
$$q(\boldsymbol \theta'|\boldsymbol \theta^{(r)}) = p(\theta_1'|\theta_2') \cdot \mathbf 1({\theta_2'= \theta_2^{(r)})},$$
where $\mathbf 1(\cdot)$ returns 1 when the testing condition is true and 0 otherwise. The transition kernel basically states $\theta_2^{(r)}$ is not changed.
The acceptance probability then is
$$a \triangleq \frac{p(\boldsymbol \theta') q(\boldsymbol \theta^{(r)}|\boldsymbol \theta')}{p(\boldsymbol \theta^{(r)})q(\boldsymbol \theta'|\boldsymbol \theta^{(r)})} = \frac{p(\theta_1', \theta_2') \times p(\theta_1^{(r)}|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}.$$
When $\theta_2' = \theta_2^{(r)}$,
$$a = \frac{p(\theta_1', \theta_2^{(r)}) \times p(\theta_1^{(r)}|\theta_2^{(r)})}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2^{(r)})}= \frac{p(\theta_1', \theta_2^{(r)}) \times \frac{p(\theta_1^{(r)}, \theta_2^{(r)})}{p(\theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times \frac{p(\theta_1',\theta_2^{(r)})}{p(\theta_2^{(r)})}} = \frac{p(\theta_2^{(r)})}{p(\theta_2^{(r)})} =1.0,$$
when $\theta_2'\neq \theta_2^{(r)}$, we have $a=\tfrac{0}{0}=1$, a trivial case. Therefore, there is no need to test the proposal. The proposed state should also be accepted.
We have already shown that a high acceptance rate does not necessarily imply high efficiency. The same applies to Gibbs sampling. The zig-zag exploration scheme of Gibbs sampling can be slow for highly correlated sample space. For example, in our bi-variate Gaussian example, Gibbs sampling has achieved around ess = 129 (out of 2000 samples) and an efficiency around 0.0645.
Summary statistics of Gibbs sampling
begin
gibbs_chain = Chains(samples_gibbs[:, 2001:end]', [:θ₁, :θ₂])
summarystats(gibbs_chain)
end
| parameters | mean | std | naive_se | mcse | ess | rhat | |
|---|---|---|---|---|---|---|---|
| 1 | :θ₁ | 0.0802593 | 0.963368 | 0.0215416 | 0.0764537 | 130.184 | 1.03633 |
| 2 | :θ₂ | 0.0918141 | 0.953711 | 0.0213256 | 0.0774604 | 127.312 | 1.0346 |
We have discussed the limitation of the Metropolis-Hastings algorithm: the proposal distribution is hard to tune. Ideally, we want a proposal distribution with the following properties:
propose the next state as far, therefore, independent, from the current state as possible
at the same time, the proposed state should be of good quality,
ideally as good (of high probability density) as the current one, leading to an acceptance rate $a$ close to one
Simple proposals, such as Gaussians, cannot achieve both of the desired properties. The problem lies in their random-walk exploration nature. The proposals blindly propose the next steps and ignore the local geographical information of the target distribution. Gradients, which always point to the steepest ascent direction, are the geographic information that can guide the proposal to reach a further yet high probability region.
To understand the idea, let's consider a more complicated target distribution which is formed by super-imposing three bivariate Gaussians together. The three Gaussians are with means $[-4,0], [4,0],$ and $[0,0]$, and the variances $\begin{bmatrix} 2 & 1.5 \\ 1.5& 2\end{bmatrix}$, $\begin{bmatrix} 2 & -1.5 \\ -1.5 & 2\end{bmatrix}$ and $\begin{bmatrix} 2 & 0 \\ 0 & 2\end{bmatrix}$ respectively.
The contour and surface of the targe distribution is plotted below for reference.
begin
d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
d3 = MvNormal([0, 0], [2 0; 0. 2])
d = MixtureModel([d1, d2, d3])
end;
begin
ℓ(x) = logpdf(d, x)
∇ℓ(x) = ForwardDiff.gradient(ℓ, x)
end;
begin
con_p = contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
surf_p = surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)
Plots.plot(con_p, surf_p, layout=(1,2))
end
A vector field view of the gradient of the target density is shown below. Note that at each location, a gradient (a blue vector) always points to the steepest ascent direction. The gradient therefore provides key geographical information about the landscape. Hamiltonian sampler proposes the next state by making use of the gradient information.
begin
meshgrid(x, y) = (repeat(x, outer=length(y)), repeat(y, inner=length(x))) # helper function to create a quiver grid.
∇ℓ_(x, y) = ∇ℓ([x, y])/5
contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, title="Contour plot and gradient plot of the target density")
xs,ys = meshgrid(range(-9, stop=9, length=15), range(-6,stop=6,length=11))
quiver!(xs, ys, quiver=∇ℓ_, c=:blue)
end
Foldable("Julia code for the target distribution and visualisation.", md"By using Julia's `Distributions.jl` and `ForwardDiff.jl`, we can formulate the density and its corresponding gradient function as following:
```julia
d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
d3 = MvNormal([0, 0], [2 0; 0. 2])
# Superimpose the three Gaussians together to form a new density
d = MixtureModel([d1, d2, d3])
# log likelihood of the mixture distribution
ℓ(x) = logpdf(d, x)
# gradient of the log pdf
∇ℓ(x) = ForwardDiff.gradient(ℓ, x)
```
The contour and surface of the targe distribution can be plotted easily via
```julia
contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)
```
")
By using Julia's Distributions.jl and ForwardDiff.jl, we can formulate the density and its corresponding gradient function as following:
d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
d3 = MvNormal([0, 0], [2 0; 0. 2])
# Superimpose the three Gaussians together to form a new density
d = MixtureModel([d1, d2, d3])
# log likelihood of the mixture distribution
ℓ(x) = logpdf(d, x)
# gradient of the log pdf
∇ℓ(x) = ForwardDiff.gradient(ℓ, x)
The contour and surface of the targe distribution can be plotted easily via
contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)
A very useful metaphor to understand the Hamiltonian sampler is to imagine a hockey pluck sliding over a surface of the negative target distribution without friction. The surface is formed by the negative target density (i.e. flip the surface plot, and intuitively a mountain becomes a valley).
surface(-9:0.1:9, -6:0.1:6, (x,y) -> -pdf(d, [x,y]), legend=false, title="Negative of the target density")
At each iteration, a random initial force (usually drawn from a Gaussian) is applied to the pluck, the pluck then slides in the landscape to reach the next state.
A numerical algorithm called Leapfrog integration is usually used to simulate the movement of the pluck. In particular, the trajectory is simulated by a sequence of $T$ steps with length $\epsilon$. The two tuning parameters are:
$\epsilon$: each Leapfrog's step length
T: total number of steps taken to form the whole trajectory
Therefore, the total length of the pluck's movement is proportional to $\epsilon \times T$.
The animation below demonstrates the idea, where 10 independent HMC proposals are simulated from an initial starting location at $[0, 2.5]$ by the Leapfrog algorithm:
10 different random forces of different directions and strengths were applied to the pluck
the plank follows the law of physics to explore the landscape (note how the plucks moves around the curvature of the valley)
begin
T_step = 50
traj_n = 10
x0 = [0, 2.5]
hmc_trajs = zeros(2, T_step+1, traj_n)
Random.seed!(189)
for t in 1:traj_n
hmc_trajs[:, :, t] = hamiltonian_proposal_step(ℓ, ∇ℓ, x0, 0.1, T_step)
end
end
let
hmc_anim_plt = plot(-9:0.1:9, -6:0.1:6, (x,y) -> -pdf(d, [x,y]), st=:contour, legend=false, title="HMC's proposal animation", xlabel=L"\theta_1", ylabel=L"\theta_2")
scatter!(hmc_anim_plt, [x0[1]], [x0[2]], color= traj_n+1)
ts = 1:traj_n
hmc_anim = @animate for i in 1:2:T_step
for t in ts
plot!(hmc_trajs[1, i:i + 1, t], hmc_trajs[2, i:i + 1, t], seriestype=:path,
lc=t, la=1.0, lw=3.0, label="")
end
end
gif(hmc_anim, fps=2)
end
let
hmc_plt = plot(-9:0.1:9, -6:0.1:6, (x,y) -> -pdf(d, [x,y]), st=:contour, legend=false, title="HMC's proposals end snapshot", xlabel=L"\theta_1", ylabel=L"\theta_2")
scatter!([x0[1]], [x0[2]], color= traj_n+1)
for t in 1:traj_n
for i in 1:2:T_step
plot!(hmc_trajs[1, i:i + 1, t], hmc_trajs[2, i:i + 1, t], seriestype=:path,
lc=t, la=1.0, lw=3.0, label="")
end
# scatter!([hmc_trajs[1, end, t]], [hmc_trajs[2, end, t]], color=t, label="HMC $(t)")
x = [x0[1], hmc_trajs[1, end, t]]
y = [x0[2], hmc_trajs[2, end, t]]
plot!(x,y, marker=:circle, arrow=true, arrowsize=1, lw=2, lc=t, mc=t)
end
# plot!(x,y, marker=:circle, arrow=true, arrowsize=0.5)
hmc_plt
end
It is important to observe that
each proposal is far away from the initial starting location
and the quality of the ending proposals is good compared with the starting point.
We have only illustrated the intuition behind HMC method. Keen readers should read references such as [3] for further details about Hamiltonian MC methods.
In this section, we are going to compare the original Metropolis-Hastings (MH) with the Hamiltonian Monte Carlo sampler (HMC). The mixture density model is reused here.
Firstly, we visually inspect the algorithms' sampling process. Animations of MH and HMC algorithms are presented below to help us to gain some feeling about the algorithms. The MH algorithm has used a proposal variance $\sigma_q^2 = 0.1$; and the HMC is simulated with the following tuning parameters
$\epsilon=0.1$: each Hamiltonian's step size of the proposal; the larger $\epsilon$ is, the proposed trajectory is further
Trange = [10, 50]: the number of steps to take for each proposal, a random number between 10 and 50;
begin
anim_mh_mixture = demo_mh_gif(ℓ;xlim= [-9,9], ylim=[-6,6], qΣ= 0.1* Matrix(I,2,2))
anim_hmc_mixture = demo_hmc_gif(ℓ, ∇ℓ; xlim= [-9,9], ylim=[-6,6], ϵ= 0.1, Trange=[10, 50])
end;
gif(anim_mh_mixture; fps= 10 )
gif(anim_hmc_mixture; fps=10)
It can be observed that HMC performs significantly better than the original MH sampler
note that MH has only managed to explore the left two modes in the first 200 iterations.
HMC has a much higher acceptance rate
yet it explores the landscape better (and it does not suffer the usual drawback of the high acceptance rate of the MH sampler);
We can use Julia's MCMCChains.jl to carry out standard chain diagnosis.
traceplot visualisation
We first inspect the chains visually by using traceplot. The MH's trace plots show high temporal correlations between the samples (which implies the chain has not yet converged) while HMC's traces seem to mix well.
begin
Random.seed!(100)
spl_mh_mix, _, _, _ = metropolis_hastings(ℓ, 4000; Σ = 0.1 * Matrix(I, 2, 2))
chain_mh_mix = Chains(spl_mh_mix[:, 2001:end]', [:θ₁, :θ₂])
spl_hmc_mix, _, _, _ = hmc_sampling(ℓ, ∇ℓ, 4000, [0, 0]; ϵ=0.1, Τrange =[10,50])
chain_hmc_mix = Chains(spl_hmc_mix[:, 2001:end]', [:θ₁, :θ₂])
end;
MH sampler's trace plot: note the high temporal correlations between the iterations
traceplot(chain_mh_mix)
HMC sampler's trace plot: both dimensions have mixed well
traceplot(chain_hmc_mix)
ess and efficiency
Efficiency metrics can also computed and compared between the two algorithms. For 2000 iterations (after the first 2000 discarded as burn-in), HMC produces around 1183 independent samples while there are only less than 17 effective sample contained in the original MH sample. HMC is therefore 60 fold more efficient than the ordinaly MH algorithm.
begin
ess_stats = [mean(summarystats(chain_mh_mix)[:, :ess]), mean(summarystats(chain_hmc_mix)[:, :ess])]
efficienty_stats = ess_stats/2000
df = DataFrame(methods = ["MH", "HMC"], ess=ess_stats, efficiency=efficienty_stats)
end
| methods | ess | efficiency | |
|---|---|---|---|
| 1 | "MH" | 16.1911 | 0.00809556 |
| 2 | "HMC" | 1183.27 | 0.591637 |
In this section, we have introduced MCMC, a class practical computational method for Bayesian inference. MCMC aims at generating Monte Carlo samples from a target distribution, where the unknown normalising constant is not required. All inference questions can then be calculated by Monte Carlo estimation, which is scalable even if the parameter space is of high dimensions.
However, sampling from a general high-dimensional distribution efficiently is not trivial. Traditional MCMC algorithms, such as Metropolis-Hastings and Gibbs sampling, suffer from random walk behaviour. More advanced algorithms, such as Hamiltonian sampler which employs gradient information of the target distribution, usually perform better.
Implementing MCMC algorithms by oneself clearly is not idea. Luckily, as an end user, one rarely needs to implement a MCMC algorithm. In the next chapter, we will see how probabilistic programming software such as Turing.jl simplifies the process. One only usually needs to specify a Bayesian model and leave the MCMC sampling task to the software. Nevertheless, the user still needs to be able to do practical convergence diagnoistics to check the quality of the sample, which arguably is the most important takeaway knowledge for an applied Bayesian inference user.
Some important concepts/terminologies about MCMC are summarised below:
MCMC: aims at generating samples from a non-normalised distribution; the Markov chain samples can then be used for Monte Carlo estimation
warm-up (or burn-in): refers to the initial chunk of MCMC samples that are not representative of the target distribution; the burn-in samples should be discarded
mix well: the chain has explored the target distribution well and reached equilibrium
ess: effective sample size; how many independent samples are contained in a dependent chain; the larger the better
R statistic: a statistic to diagnose a chain; as a rule of thumb, a good chain should have an R statistic less than 1.01
1
This is a contrived question to illustrate the idea of MH algorithm. We do not need to run any algorithm to generate samples from a deterministic die. The sample should be all threes.
2
It depends on how $\tfrac{0}{0}$ is defined. We have assumed it is zero or NaN here.
Juia code used for this chapter
demo_mh_gifbegin
"""
demo_mh_gif(ℓπ;)
Produce a gif that demonstrates how Metropolis-Hastings algorithm works
### Input
- `ℓπ` -- log pdf of the target distribution
- `qΣ` -- proposal distribution's variance, should be `dim` × `dim` symmetric and P.D. matrix
- `mc` -- number of iterations to simulate
- `x₀` -- the initial starting point
- `xlim, ylim` -- horizontal and vertical limits of the plot
### Output
- `anim` -- an array of figures
### Examples
```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
anim = demo_mh_gif(ℓπ; qΣ = 0.005 * Matrix(I,2,2))
gif(anim, fps= 10)
```
"""
function demo_mh_gif(ℓπ; qΣ = 1.0 * Matrix(I,2,2), mc = 200, x₀= zeros(2), xlim=[-4, 4], ylim = [-4, 4])
Random.seed!(100)
chain, proposed, acpt, rate = metropolis_hastings(ℓπ, mc+1; Σ = qΣ, x0 = x₀)
anim = produce_anim(ℓπ, chain, proposed, acpt, rate; mc = mc, xlim=xlim, ylim = ylim)
return anim
end
end
begin
"""
demo_hmc_gif(ℓπ, ∇ℓπ; ϵ = 0.05, mc = 200, x₀= zeros(2), Trange=[100,200], xlim=[-4, 4], ylim = [-4, 4])
Produce a gif that demonstrates how Metropolis-Hastings algorithm works
### Input
- `ℓπ` -- log pdf of the target distribution
- `∇ℓπ` -- gradient of the log pdf of the target distribution
- `ϵ` -- step size of the HMC's Leap Frog simulation
- `mc` -- number of iterations to simulate
- `x₀` -- the initial starting point
- `Trange` -- a range of possible steps of the Leap Frog algorithm
- `xlim, ylim` -- horizontal and vertical limits of the plot
### Output
- `anim` -- an array of figures
### Examples
```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
∇ℓπ = (x) -> ForwardDiff.gradient(ℓπ, x)
anim = demo_hmc_gif(ℓπ, ∇ℓπ)
gif(anim, fps= 10)
```
"""
function demo_hmc_gif(ℓπ, ∇ℓπ; ϵ = 0.05, mc = 200, x₀= zeros(2), Trange=[100,200], xlim=[-4, 4], ylim = [-4, 4])
Random.seed!(100)
chain, proposed, acpt, rate = hmc_sampling(ℓπ, ∇ℓπ, mc+1, x₀; ϵ = ϵ, Τrange =Trange)
anim = produce_anim(ℓπ, chain, proposed, acpt, rate; mc = mc, xlim=xlim, ylim = ylim, title="HMC demonstration", with_accpt_rate=false)
return anim
end
end
begin
"""
produce_anim(ℓπ, chain, proposed, acpt, rate; mc = 200, xlim=[-4, 4], ylim = [-4, 4], title="MH demonstartion; ")
Produce an animation that demonstrates how Metropolis-Hastings-liked algorithm works
### Input
- `ℓπ` -- log pdf of the target distribution
- `chain` -- the mcmc trace, should be a `dim` × `mc+1` matrix (including initial state)
- `proposed` -- proposal history
- `acpt` -- acceptance history, array of booleans
- `rate` -- acceptance rate
- `mc` -- number of iterations to show
- `xlim, ylim` -- horizontal and vertical limits of the plot
### Output
- `anim` -- an array of figures
### Examples
```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
chain, proposed, acpt, rate = metropolis_hastings(ℓπ, 201)
anim = produce_anim(ℓπ, chain, proposed, acpt, rate)
gif(anim, fps= 10)
```
"""
function produce_anim(ℓπ, chain, proposed, acpt, rate; mc = 200, xlim=[-4, 4], ylim = [-4, 4], title="MH demonstartion; ", with_accpt_rate=true)
if with_accpt_rate
title = title * "acceptance rate=$(rate)"
end
plt_ = plot(range(xlim[1], xlim[2], length= 100), range(ylim[1], ylim[2], length= 100), (x, y)-> exp(ℓπ([x,y])), legend = :none, st=:contour, linewidth=0.5, la=0.5, levels=5, title=title)
anim = @animate for i in 1:mc
scatter!(plt_, (chain[1, i], chain[2, i]),
label=false, mc=:green, ma=0.5)
if !acpt[i]
scatter!(plt_, (proposed[1, i], proposed[2, i]), label=false, mc =:red, ma=0.4)
plot!([chain[1, i], proposed[1, i]], [chain[2, i], proposed[2, i]], st=:path, lc=:red, linestyle=:dot, la=0.5, label=false)
end
plot!(plt_, chain[1, i:i+1], chain[2, i:i+1], st=:path, lc=:green, la=0.5, label=false)
end
return anim
end
end
metropolis_hastingsbegin
"""
metropolis_hastings(ℓπ, mc=500; dim=2, Σ = 10. * Matrix(I, dim, dim), x0=zeros(dim))
Sample a target probability distribution by Metropolis-Hastings algorithm with random walk Gaussian proposal
### Input
- `ℓπ` -- log pdf of the target distribution
- `mc` -- number of MC samples to simulate
- `dim` -- dimension of the target distribution
- `Σ` -- proposal distribution's variance, should be `dim` × `dim` symmetric and P.D. matrix
- `x0` -- the initial starting point
### Output
- `samples` -- the `mc` iterations of samples, a `dim` × `mc` array
- `proposed` -- the history of the proposals, only for visualisation purpose
- `acceptance` -- whether the proposals being accepted or not
- `accpt_rate` -- acceptance rate of the chain
### Examples
```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
metropolis_hastings(ℓπ, 500; Σ = 1.0 * Matrix(I, 2, 2))
```
"""
function metropolis_hastings(ℓπ, mc=500; dim=2, Σ = 10. * Matrix(I, dim, dim), x0=zeros(dim))
samples = zeros(dim, mc)
proposed = zeros(dim, mc-1)
acceptance = Array{Bool, 1}(undef, mc-1)
fill!(acceptance, false)
samples[:, 1] = x0
L = cholesky(Σ).L
ℓx0 = ℓπ(x0)
accepted = 0
for i in 2:mc
# xstar = rand(MvNormal(x0, Σ))
xstar = x0 + L * randn(dim)
proposed[:, i-1] = xstar
ℓxstar = ℓπ(xstar)
r = exp(ℓxstar - ℓx0)
# if accepted
if rand() < r
x0 = xstar
ℓx0 = ℓxstar
accepted += 1
acceptance[i-1] = true
end
@inbounds samples[:, i] = x0
end
accpt_rate = accepted / (mc-1)
return samples, proposed, acceptance, accpt_rate
end
end
gibbs_samplingbegin
"""
gibbs_sampling(mc=500; μ= zeros(2), ρ=0.9, σ₁² = 1.0, σ₂²=1.0, x0=zeros(2))
Sample a bivariate Gaussian with mean ``\\mu`` and variance-covariance ``\\begin{bmatrix} \\sigma_1^2 & \\rho \\\\ \\rho & \\sigma_2^2 \\end{bmatrix}`` by Gibbs sampling.
### Note
This only works for a bivariate Gaussian target distribution; it is not a general sampling algorithm
### Examples
gibbs\\_xs = gibbs\\_sampling(4000)
chain\\_gibbs = Chains(gibbs_xs')
"""
function gibbs_sampling(mc=500; μ= zeros(2), ρ=0.9, σ₁² = 1.0, σ₂²=1.0, x0=zeros(2))
samples = zeros(2, mc)
x₁, x₂ = x0[1], x0[2]
σ₁, σ₂ = sqrt(σ₁²), sqrt(σ₂²)
k₁ = ρ * σ₂ / σ₁
k₂ = ρ * σ₁ / σ₂
σ₁₂ = σ₂ * sqrt(1 - ρ^2)
σ₂₁ = σ₁ * sqrt(1 - ρ^2)
@inbounds samples[:, 1] = x0
for i in 2:mc
if i % 2 == 0
x₁ = rand(Normal(μ[1] + k₁ * (x₂ - μ[2]), σ₁₂))
else
x₂ = rand(Normal(μ[2] + k₂ * (x₁ - μ[1]), σ₂₁))
end
@inbounds samples[:, i] = [x₁, x₂]
end
return samples
end
end
hmc_samplingbegin
"""
hmc_sampling(ℓπ, ∇ℓπ, mc, x₀; ϵ, Τrange =[100,200])
Sample a target probability distribution by Hamiltonian Monte Carlo sampler (HMC)
### Input
- `ℓπ` -- log pdf of the target distribution
- `∇ℓ` -- gradient of the log pdf
- `mc` -- the number of samples to draw
- `x₀` -- the initial starting point
- `ϵ` -- the Leepfrog's step size
- `Trange` -- the min and max Leep Frog steps
### Output
- `samples` -- the `mc` iterations of samples, a `dim` × `mc` array
- `accpt_rate` -- acceptance rate of the HMC
### Examples
```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
∇ℓπ = (x) -> ForwardDiff.gradient(ℓπ, x)
hmc_sampling(ℓπ, ∇ℓπ, 1000, randn(2); ϵ= 0.05, Τrange = [100,200])
```
"""
function hmc_sampling(ℓ, ∇ℓ, mc, x₀; ϵ, Τrange =[100,200])
dim = length(x₀)
samples = zeros(dim, mc)
proposed = zeros(dim, mc-1)
acceptance = Array{Bool, 1}(undef, mc-1)
fill!(acceptance, false)
samples[:, 1] = x₀
g₀ = -1 * ∇ℓ(x₀)
E₀ = -1 * ℓ(x₀)
n_accept = 0
for i in 2:mc
# initial momentum is Normal(0, 1)
p = randn(dim)
# evaluate old H(x₀, p) = E(x₀) + K(p)
# where E(x₀) = - ℓ(x₀)
H = (p' * p)/2 + E₀
xnew, gnew = x₀, g₀
# make Τ leapfrog simulation steps
Τ = rand(Τrange[1]:Τrange[2])
# optional
ϵ = rand([-1, 1]) * ϵ
for τ in 1:Τ
# make half-step in p
p = p - ϵ * gnew /2
# make step in x
xnew = xnew + ϵ * p
# find new gradient
gnew = -1 * ∇ℓ(xnew)
# make half-step in p
p = p - ϵ * gnew /2
end
# find new value of E and then H
proposed[:, i-1] = xnew
Enew = -1 * ℓ(xnew)
Hnew = (p' * p)/2 + Enew
dH = Hnew - H
if dH < 0
accept = true
elseif rand() < exp(-dH)
accept = true
else
accept = false
end
if accept
x₀, g₀, E₀ = xnew, gnew, Enew
n_accept = n_accept + 1
end
samples[:, i] = x₀
acceptance[i-1] = accept
end
return samples, proposed, acceptance, n_accept/(mc-1)
end
end
begin
function hamiltonian_proposal_step(ℓπ, ∇ℓπ, x₀, ϵ, T)
dim = length(x₀)
g₀ = -1 * ∇ℓπ(x₀)
E₀ = -1 * ℓπ(x₀)
# initial momentum is Normal(0, 1)
p = randn(dim)
# evaluate old H(x₀, p) = E(x₀) + K(p)
# where E(x₀) = - ℓπ(x₀)
H = (p' * p)/2 + E₀
xnew, ∇new = x₀, g₀
xs = zeros(dim, T+1)
xs[:, 1] = xnew
# make Τ leapfrog simulation steps
# Τ = rand(Τrange[1]:Τrange[2])
# optional
# ϵ = rand([-1, 1]) * ϵ
for τ in 1:T
# make half-step in p
p = p - ϵ * ∇new /2
# make a step in x based on the speed p
xnew = xnew + ϵ * p
# find new gradient
∇new = -1 * ∇ℓπ(xnew)
# make half-step in p
p = p - ϵ * ∇new /2
xs[:, τ+1] = xnew
end
return xs
end
end
hamiltonian_proposal_step (generic function with 1 method)
Code for referencing equations
js(x) = HypertextLiteral.JavaScript(x)
js (generic function with 1 method)
"""
`texeq(code::String)`
Take an input string and renders it inside an equation environemnt (numbered) using KaTeX
Equations can be given labels by adding `"\\\\label{name}"` inside the `code` string and subsequently referenced in other cells using `eqref("name")`
### Note
Unfortunately backward slashes have to be doubled when creating the TeX code to be put inside the equation
When Pluto will support interpretation of string literal macros, this could be made into a macro
"""
function texeq(code,env="equation")
code_escaped = code |>
x -> replace(x,"\\" => "\\\\") |>
x -> replace(x,"\n" => " ")
println(code_escaped)
@htl """
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.css" integrity="sha384-Um5gpz1odJg5Z4HAmzPtgZKdTBHZdw8S29IecapCSB31ligYPhHQZMIlWLYQGVoc" crossorigin="anonymous">
<script src="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.js" integrity="sha384-YNHdsYkH6gMx9y3mRkmcJ2mFUjTd0qNQQvY9VYZgQd7DcN7env35GzlmFaZ23JGp" crossorigin="anonymous"></script>
<script>
katex.render('\\\\begin{$(js(env))} $(js(code_escaped)) \\\\end{$(js(env))}',currentScript.parentElement,{
displayMode: true,
trust: context => [
'\\\\htmlId',
'\\\\href'
].includes(context.command),
macros: {
"\\\\label": "\\\\htmlId{#1}{}"
},
})
</script>
"""
end
"""
`eqref(label::String)`
Function that create an hyperlink pointing to a previously defined labelled equation using `texeq()`
"""
eqref(label) = @htl """
<a eq_id="$label" id="eqref_$label" href="#$label" class="eq_href">(?)</a>
"""
@htl """
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.css" integrity="sha384-Um5gpz1odJg5Z4HAmzPtgZKdTBHZdw8S29IecapCSB31ligYPhHQZMIlWLYQGVoc" crossorigin="anonymous">
<style>
a.eq_href {
text-decoration: none;
}
</style>
<script src="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.js" integrity="sha384-YNHdsYkH6gMx9y3mRkmcJ2mFUjTd0qNQQvY9VYZgQd7DcN7env35GzlmFaZ23JGp" crossorigin="anonymous"></script>
<script id="katex-eqnum-script">
const a_vec = [] // This will hold the list of a tags with custom click, used for cleaning listeners up upon invalidation
const eqrefClick = (e) => {
e.preventDefault() // This prevent normal scrolling to link
const a = e.target
const eq_id = a.getAttribute('eq_id')
window.location.hash = 'eqref-' + eq_id // This is to be able to use the back function to resume previous view, 'eqref-' is added in front to avoid the viewport actually going to the equation without having control of the scroll
const eq = document.getElementById(eq_id)
eq.scrollIntoView({
behavior: 'smooth',
block: 'center',
})
}
const checkCounter = (item, i) => {
return item.classList.contains('enclosing') ?
i :
i + 1
}
const updateCallback = () => {
a_vec.splice(0,a_vec.length) // Reset the array
const eqs = document.querySelectorAll('span.enclosing, span.eqn-num')
let i = 0;
eqs.forEach(item => {
i = checkCounter(item,i)
console.log('item',i,'=',item)
if (item.classList.contains('enclosing')) {
const id = item.id
const a_vals = document.querySelectorAll(`[eq_id=\${id}]`)
a_vals !== null && a_vals.forEach(a => {
a_vec.push(a) // Add this to the vector
a.innerText = `(\${i+1})`
a.addEventListener('click',eqrefClick)
})
}
})
}
const notebook = document.querySelector("pluto-notebook")
// We have a mutationobserver for each cell:
const observers = {
current: [],
}
const createCellObservers = () => {
observers.current.forEach((o) => o.disconnect())
observers.current = Array.from(notebook.querySelectorAll("pluto-cell")).map(el => {
const o = new MutationObserver(updateCallback)
o.observe(el, {attributeFilter: ["class"]})
return o
})
}
createCellObservers()
// And one for the notebook's child list, which updates our cell observers:
const notebookObserver = new MutationObserver(() => {
updateCallback()
createCellObservers()
})
notebookObserver.observe(notebook, {childList: true})
invalidation.then(() => {
notebookObserver.disconnect()
observers.current.forEach((o) => o.disconnect())
a_vec.forEach(a => a.removeEventListener('click',eqrefClick))
})
</script>
"""
Foldable details
begin
begin
struct Foldable{C}
title::String
content::C
end
function Base.show(io, mime::MIME"text/html", fld::Foldable)
write(io,"<details><summary>$(fld.title)</summary><p>")
show(io, mime, fld.content)
write(io,"</p></details>")
end
end
end